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Abstract 

We present a proof that in a fat-tree network with n processing nodes m <n messages with 
randomly chosen, distinct sources and independently and randomly chosen destinations are 
delivered within O(lgra) delivery rounds with high probability. More succinctly, we establish 
that m messages are delivered in 0(lgra + In 1/e) delivery rounds with probability 1 — e for any 
small e > 0. Unlike previously applied proof methods, we use an approximating model for the 
collision behavior of the network amenable to concise yet simple theoretical analysis. We justify 
the accuracy of the approximation by means of behavioral simulations based on a gate-level 
implementation of a fat-tree network. 

1 Introduction 

Fat-tree networks are established as area-universal communication networks due to the seminal 
work of Charles E. Leiserson [8, 3], culminating in the implementation of the Connection Machine 
CM-5 at Thinking Machines Corporation [9]. Today, advances in semiconductor technology enable 
us to integrate multiprocessor machines on a single chip, as explored in the Raw project [12], for 
example. As the number of processors on a chip increases, employing one or more fat-tree networks 
as interconnection medium is an attractive design alternative. 

The theoretical properties of fat-tree networks constitute a compelling reason to consider them 
for single-chip multiprocessors. In this article, we reevaluate the theoretical performance of a 
fat-tree network with respect to delivery times of messages. We present a proof that m < n 
messages with randomly chosen sources and destinations can be delivered in a fat-tree network 
with n processing nodes within O(lgra) delivery rounds with high probability. Our result improves 
on previously published bounds based on the number of processing nodes n rather than the number 
of messages m. Leiserson [8] derived a bound using the load factor A of a set of messages. He has 
shown that the number of delivery rounds required to deliver a set of messages, where the sources 
and destinations are known in advance, is O(Algn). Greenberg and Leiserson [3] have derived a 
bound 0(X + lgnlglgn) for the number of delivery rounds when the sources and destinations of 
messages are unknown, assuming that the probability of congesting a channel follows the binomial 
distribution, however. 

Empirical evidence shows that these bounds are conservative. To prove our tighter bound, 
we develop a model for the collision behavior of messages. Since this model merely approximates 
the actual occurrence of collisions, we present empirical evidence that it reflects reality accurately 
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enough to justify our time bound. We have developed a gate-level implementation of the fat-tree 
network and a behavioral simulator that permits us to scale our simulations up to large numbers 
of processing nodes. Our simulations show that O(lgra) is not only an upper bound for randomly 
chosen message sources and destinations, but for many regular communication patterns as well. 

Our goal is to derive a suitable model of the collision behavior of a fat tree that approximates 
reality with sufficient accuracy and permits a concise yet simple theoretical analysis at the same 
time. Previous work, such as [3, 1, 4, 5, 6, 7, 10], suggests that an exact analysis requires significant 
theoretical armory. While most of this work tackles more general routing problems, we are not aware 
of any approaches with a goal similar to ours. 

2 Proof Outline 

Our proof is based on the structural analysis of a particular fat-tree network architecture, which 
results in the average probability Pr[Cy of a collision of two messages with randomly chosen sources 
and destinations. This probability embodies the structure of the fat tree. 

We then model the collision behavior of m > 2 messages by means of an approximating balls- 
and-bins game. The simple balls-and-bins game neglects probabilistic dependencies. Nevertheless, 
in Section 5 we show empirically that neglecting dependencies due to the random selection of 
message destinations affects the result by a small constant factor only. We calibrate the number 
of collision bins to reflect the probability Pr[Cy. Messages correspond to balls tossed into collision 
bins. A message may be rejected or delivered depending on the outcome of the collision toss. 
Rejected messages must be retried, leading to a model of subsequent delivery rounds that correspond 
to a sequence of collision tosses. 

We prove the result in two phases, depending on whether the number of messages m is larger 
than the number of collision bins b or not. We assume that all messages rejected in one delivery 
round are retried during the subsequent delivery round. In phase I we prove that the number of 
messages delivered per delivery round for m > b is larger than a constant amount with at least 
constant probability. In phase II we prove that the fraction of messages delivered per delivery round 
for m < b is larger than a constant amount with at least constant probability. In both phases, 
the expected number of delivery rounds is O(lgra). Finally, we use a Chernoff bound to establish 
the high-probability result for each phase that m messages are delivered within 0(lgra + lnl/e) 
delivery rounds with probability 1 — e for any small e > 0. 

3 Fat-Tree Architecture 

Our proof is restricted to the architecture of the fat-tree network shown in Figure 1 with the router 
design described below. 1 Whether our proof methodology is applicable to leaner trees or even 
entirely different network architectures remains an open question. 

We introduce the following design decisions. The network shall be circuit- switched, where 
messages reserve a path from the source to the destination on their way through the network. 
In contrast to packet routing, this design is particularly suited for pipelining streams of data 
through an array of processors with register-mapped networks. Applications such as digital signal 
processing would be a primary beneficiary of this design choice. In a circuit-switched network, an 



x The fat-tree network under investigation is similar, yet different from a back-to-back butterfly or Benes network, 
because of its connections between the downstream ports. The fat-tree network is relatively easy to implement and 
realizable with today's and future micro-technology, which offers six or more levels of metal. 
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Figure 1: Structure of fat-tree network with 16 processing nodes. 

explicit acknowledgment signal is used to release the resources of the reserved path. Consequently, 
no buffering is needed at the router nodes other than a small, constant number of pipeline registers. 
Furthermore, each processing node may have at most one message transmission and one, potentially 
simultaneous message reception in progress. Each of the links in Figure 1 is a bidirectional link, 
or full-duplex link, consisting of two sets of wires, each responsible for transmitting signals in one 
direction. Each router of the network has four ports a, b, c, and d, and each port has an incoming 
and an outgoing set of wires, as shown in Figure 2. We call ports a and b the downstream ports, 
and ports c and d the upstream ports as obvious from Figure 1. 



Figure 2: Router design of a full-duplex fat tree. 



Each router is designed to transmit and reject messages according to the following behavior. 
Upstream messages arrive on one of the downstream ports a or b, and are transmitted through 
one of the upstream ports c or d at random. This upstream port selection is the only source of 
randomness in the routing process. If one upstream port is in use when a second upstream mes- 
sage arrives, the available port is assigned to the second message deterministically. Downstream 
messages are transmitted through one of the downstream ports a or b. Since the downstream 
ports have only one set of outgoing wires, a and b respectively, contention may occur if more 
than one message shall be transmitted through one of theses downstream ports. For example, if 
two downstream messages arrive on ports c and d, and both shall be transmitted through port a, 
only one of them may use wire a . The other message will be rejected, that is a collision signal will 
be sent to the sender for notification. The sender is responsible for initiating a retry. 

The collision behavior of our router design obeys the simple message rejection rule: all 
but one of the downstream messages with the same outgoing port are rejected. Messages can 
collide only while traveling downstream. There exist two characteristic collision scenarios. (1) Two 
downstream messages arrive at the upstream ports to be transmitted through the same downstream 
port. (2) One downstream message arrives at one of the downstream ports, another downstream 
message at one of the upstream ports, and both shall be transmitted through the same downstream 



port. In both scenarios, one message is rejected, and the other passes successfully. If both scenarios 
happen simultaneously, that is three downstream messages arrive on one downstream port and both 
upstream ports, then two messages are rejected, and one passes through the router. 

Noteworthy is that messages cannot collide while traveling upstream, because the network 
architecture doubles the amount of wires at each level of router nodes from the leaves towards 
the root. Therefore, we do not have to be concerned about contention on the upstream paths of 
messages, even if each processing node injects a message into the tree. 

We introduce the following naming scheme for the network routers. We denote a router at level / 
in the tree a level-l router or L\-router . An Lo-router is a leaf node of the tree, connecting two 
processing nodes. A router node at level I in the fat tree consists of 2 l individual L^-routers. In 
Figure 1, a router node is shown as a rectangle if it comprises more than one router. Furthermore, 
we have annotated one router node at each level in the tree with the corresponding levels Lq, Li, 
L 2 , and L 3 . 

4 Structural Analysis 

We analyze the collision behavior of a fat-tree network to compute the probability of a collision 
between two messages. 

Lemma 1 (2-Message Collision Probability) Two messages with randomly chosen, distinct 
sources and independently and randomly chosen destinations collide on average with probability 

V n 

in the fat tree with n processing nodes described in Section 3. Moreover, PrfCy can be bound as 
follows for n > 0: 

^ < Pr[C 2 ] < ^. 
3n - L 21 ~ 2n 

Proof: We employ an accounting argument of basic collision events covering the entire sample 
space of possible collisions. We fix the sender of message mi at node of the fat tree without 
loss of generality. This gives us a choice of n — 1 destinations for mi, n — 1 possible sources of 
message 777,2, and n — 1 possible destinations for 7772. Hence, our sample space comprises (77 — l) 3 
distinct elementary events. 

We utilize the symmetry of the fat tree to account for entire subtrees at a time. In particular, 
we consider ^-subtrees with 2 V nodes and denote as Pr[A;,i, j] the probability that 7771 with source 
node and its destination node in the i-subtree collides with 7772 with its source node in the k- 
subtree and its destination node in the j-subtree. The subtrees are uniquely specified such that 
all nodes in a z^-subtree have the same dilation 2(v + 1), counted in number of links, from the 
respective reference node. 

The destination subtrees of 7771 are the i-subtrees. Since the source of mi is fixed at node 0, 
we can easily identify the i-subtrees with respect to node 0. For i = 0, the only node with dilation 
2(0 + 1) = 2 is node 1; cf. Figure 1. Thus, the (i = 0)-subtree is {1}. For i = 1, the nodes 
with dilation 2(1 + 1) = 4 are 2 and 3. Therefore, the (i = l)-subtree is {2,3}. Analogously, the 
(i = 2)-subtree is {4, 5, 6, 7}, the (i = 3)-subtree is {8, . . . , 15}, etc. We observe that, in general, 
the (i = z/)-subtree is the set of nodes {2^, . . . , 2^ +1 — 1}. 



The fc-subtrees contain the possible source nodes of 777,2 with respect to source node of mi. 
Therefore, the fc-subtrees are identical to the i-subtrees. The j-subtrees contain the possible desti- 
nation nodes of 777,2 with respect to its source node. The j-subtrees depend on the particular choice 
of the source node of 777,2. For example, consider node 10 as the source of 777,2. The (j = 0)-subtree 
is {11}, the (j = l)-subtree {8,9}, the (j = 2)-subtree {12,13,14,15}, the (j = 3)-subtree is 
{0, . . . , 7}, and so on. With respect to source node of message 777,1, node 10 is an element of the 
(A; = 3)-subtree {8,..., 15}. 

We can compute Pr[Cy by summing up the individual probabilities Pr[A;,i, j], presuming that 
Pr[fc, i, j] accounts for the average probability of a collision for all source nodes of 777,2 in the fc-subtree, 
all destination nodes of 777,1 in the i-subtree, and all destination nodes of 777,2 in the j-subtree. Since 
for a fat tree with n processing nodes the largest subtree contains n/2 processing nodes, we obtain 
for Pr[C 2 ]: 
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A structural analysis based on a particular choice of k allows us to determine Pr[A;,i, j] for 
all i and j. This analysis results in the following matrix of probabilities for a particular A;, for 
< i < lg n — 1, and for < j < lg n — 1: 







j = k 



Igf 



Pr[£;,i,j] = 



i = k 



ig : 



v 







z+2 
2 fc+i 



... i±l ... n 

ofc+l u 











fc+2 

2 i+l 



(2) 



■7 



We can derive Equation 2 by means of a case-by-case analysis depending on the relationship 
between i, j, and k. For the sake of clarity, we discuss each case in detail. Although this results in 
a lengthy analysis, it is nothing but a straightforward accounting of elementary events: 

i < k A j < k: First, consider the example i = 2, k = 3, and j = 1. The destination subtree of 777,1 
is the (i = 2)-subtree {4, 5, 6, 7}. The source node of 777,2 is in the (k = 3)-subtree {8, . . . 15}. 
The destination subtree of 777,2 is the (j = l)-subtree with respect to the source node of 777,2. 
It can be one of the four subtrees {8, 9}, {10, 11}, {12, 13}, or {14, 15} only. For any choice of 
the source of 777,2, this subtree is contained in the fc-subtree. Thus messages 777,1 and 777,2 travel 
through disjoint subtrees, and cannot collide. In general, the destination subtree of 777,1 is the 
i-subtree {2\ . . . , 2 Z+1 — 1}. The source node of 777,2 is in the fc-subtree {2^, . . . , 2 k+1 — 1}, 
which is disjoint from the i-subtree for i < k. Finally, for j < A;, the j-subtree is a proper 
subset of the fc-subtree, and therefore disjoint from the i-subtree as well. As a consequence, 
Pr[fe, i, j] = for i < k A j < k. 



i = k = j: Consider the case i = k = j = 2. The i-subtree and the fc-subtree are {4,5,6,7}. 
The j-subtree is {0, 1, 2, 3}, independent of the particular source of 777,2 within the fc-subtree. 



Message mi travels from node to one of the nodes in the i-subtree, and 777,2 from one of 
the nodes in the fc-subtree to one of the nodes in the j-subtree. Both messages traverse in 
opposite directions through one of the L2-routers in Figure 1, potentially the same router. 
Since all links are bidirectional, the tree supports this criss-crossing message pattern without 
collisions, even if the messages traverse through the same router. We can easily generalize 
this case and see that Pr[fc, i, j] = for alii = k = j. 

i > k > j or i < k < j: Consider the example i = 3, k = 2, and j = 1. The i-subtree is {8, ... , 15} 
and the fc-subtree is {4,5,6,7}. For any choice of the source of 777,2 in the fc-subtree, its 
destination is in the (j = l)-subtree, which must be either {4,5} or {6,7}, and is a proper 
subset of the fc-subtree. Thus, message 777,2 is confined to the fc-subtree, whereas 777,1 travels 
through the tree to the i-subtree without traversing any of the routers connecting the nodes 
of the fc-subtree. The fact that messages 777,1 and 777,2 never traverse the same router is easily 
generalized for i > k > j. The case i < k < j is symmetric. Therefore, Pr[£;,i, j] = for 
i > k > j and i < k < j. 

i > j > k or j > i > k: Consider i = 3, j = 2, and k = 1. The i-subtree is {8, . . . , 15} and 
the fc-subtree is {2, 3}. There exists exactly one j-subtree for all choices of the source of 777,2 
in the fc-subtree, which is the j-subtree {4,5,6,7}. The key observation here is that both 
messages 777,1 and 777,2 travel upstream, partially in parallel, until they reach a router where 777,1 
travels further upstream towards the i-subtree whereas 777,2 turns downstream towards the j- 
subtree. In the example, this happens at one of the L2-routers in Figure 1. Since collisions 
cannot happen on the upstream paths of two messages, 777,1 and 777,2 do not collide. The case 
where j > i > k is similar, except that 777,2 travels further upstream than 777,1. We find that 
Pr[fc, i, j] = for alii > j > k and all j > i > k. 

j < i = k or i < j = k: These cases correspond to the non-zero elements in row i = k and 
column j = k of the matrix in Equation 2, respectively. We discuss the case j < i = k. 
Case i < j = k holds by symmetry. The destination node of 777,1 with source node is in the 
i-subtree {2\ . . . , 2 Z+1 — 1}. Since i = A;, the fc-subtree equals the i-subtree, and the source 
node of 777,2 is a node of the i-subtree. Without loss of generality, we consider node 2 l = 2 k to 
be the source node of 777,2. The destination of 777,2 is in the j-subtree, which is a proper subset 
of the fc-subtree, because j < k. For example, for i = k = 2 and j = 1 both the i-subtree and 
fc-subtree are {4, 5, 6, 7}. If we pick the source of 777,2 to be node 2 2 = 4, the (j = l)-subtree 
containing the potential destinations of 777,2 is {6, 7}. In general, we find that the destination 
of 777,2 must be in the j-subtree {2 k + 2 J , . . . , 2 k + 2 J+1 — 1} if the source of 777,2 is node 2 k . 

Let us study the possible collision scenarios for messages 777,1 and 777,2 by means of the preceding 
example. If 777,1 has destination 4 or 5, 777,2 niay travel to destination 6 or 7 simultaneously 
without collision. If both 777,1 and 777,2 have the same destination, which may be node 6 or 
node 7, the messages will collide with probability 1. This collision may happen either at the 
Lo-router connecting the destination node, or at one of the Li-routers connecting subtrees 
{4, 5} and {6, 7} if both messages attempt to traverse the same router. In case that the 
destinations of 777,1 and 777,2 are different, say 777,1 is destined for node 6 and 777,2 for node 7, 
a collision may occur at one of the Li-routers connecting subtrees {4, 5} and {6, 7} if both 
messages attempt to traverse the same router. If 777,1 and 777,2 travel though different L\- 
routers, they can travel collision-free through the Lo-router to their destinations. 

We can generalize the observations from this example assuming that the source of 777,2 is 2 k 



and the destination of m 2 is 2 k + 2 J ' . We dissect the j-subtree {2 k + 2 J \ . . . , 2 k + 2 J+1 - 1} into 
r-subtrees {2 k + 2^ + 2 r , . . . , 2^ + 2^' + 2 r+1 - 1} for < r < j. For example, with i = A; = 3 and 
j = 2, the source of 777,2 is node 8, the destination of 7772 is node 12, and the (j = 2)-subtree 
is {12, 13, 14, 15}. Then, the (r = 0)-subtree is {13} and the (r = l)-subtree is {14, 15}. 

We observe that if the destination of 777 1 is in the r-subtree, then 7771 and 7772 cannot collide at 
any of the L^-routers for < v < r. Thus, collisions may occur only at routers at level r + 1 
or higher in the j-subtree. We account for the collisions of 7771 and 7772 due to all routers at 
level r + 1 and higher by counting all paths of 7771 and 7772 that reach a router node at level 
r + 1 and traverse the same L r+ i-router. The message paths are determined randomly due 
to the port selections on the upstream paths. Since the upstream paths of 7771 and 7772 are 
disjoint for j < i = A;, the random selections are independent. Due to this independence, and 
since there are 2 r+1 L r+ i-routers on the downstream paths of 7771 and 7772, the probability 
that 777i or 7772 traverse a particular L r+ i-router is l/2 r+1 , respectively. Thus, the probability 
that the paths of both 7771 and 7772 traverse the same router of a router node at level r + 1 is 
2 r + 1 . l/2 r+1 • l/2 r+1 = l/2 r+1 . Consequently, the probability of a collision of 7771 and 7772 at 
a router node at level r + 1 or higher is l/2 r+1 . 

To compute probability Pr[A;,i, j], we sum up the probabilities of the independent collision 
scenarios that may occur for 7771 and 7772. We fix the source of 7772 at 2 k and the destination 
at 2 k + 2 J . By renumbering the nodes in the fc-subtree, we find that the collision probability 
for all possible sources and destinations of 7772 is equal to this particular choice. Therefore, 
it is sufficient to account for all destinations of 7771 in the i-subtree with a fixed source and 
destination of 7772 to compute the average collision probability. The collision probability is 1, 
if mi chooses the same destination 2 k + 2 J as 7772. This happens with probability 1/2 Z since 
there are 2 l possible destinations for 7771. If the destination of 7771 is outside of the j-subtree, 
the collision probability is 0. Otherwise, message mi may choose one of the 2 r destination 
nodes in the r-subtree, which is a proper subset of the j-subtree. The 2 r destinations may 
be chosen by mi with probability 2 r /2 z , each of which has the collision probability l/2 r+1 
derived above. We need to sum up the probabilities over the disjoint r-subtrees for < r < j. 
Therefore, we obtain: 

Pr [M, j\ = 2? ' 1 + Z^ 2^ ' 2^+1 = 2 i+1 ' 

r=0 

Since we are considering the case i = A;, we have Pr[/c,i, j] = (j + 2)/2 z+1 for j < i = A;, 
yielding the elements in the matrix of Equation 2 for row i = k. The column elements for 
i < j = k follow by symmetry. 

i = j > k: This case corresponds to the non-zero elements on the main diagonal of the matrix in 
Equation 2. The i-subtree is equal to the j-subtree {2 J , . . . , 2 J+1 — 1}. Similar to the previous 
case, we fix the destination of 7772 at node 2 J , and dissect the j-subtree {2 J , . . . , 2 J+1 — 1} into 
r-subtrees {2 J +2 r , . . . , 2 J +2 r+1 — 1} for < r < j. Now, the accounting of collisions depends 
on k rather than j, because the source of 7772 determines the routers on the downstream paths 
of 777i and 7772 at which collisions can occur. 

Let us consider an example first. Assume that i = j = 3, that is the (i = j = 3)-subtree is 
{8, . . . , 15}, and the destination of 7772 is node 2 J = 8. For k = 0, the (k = 0)-subtree is {1}. 
The only possible source of 7772 is node 1. Recall that we fix the source of mi to be node 0. 



Inspection of the tree in Figure 1 reveals that mi and 777,2 cannot collide except when the 
destinations of rai and 777,2 coincide at node 8, because the upstream-port selections are not 
independent. Note that no collision occurs even for node 9 as the destination of 777,1. We may 
argue that the Lo _r outer connecting subtrees {0} and {1} guarantees that 777,1 and 777,2 travel 
collision-free through the tree such that they arrive at different Li-routers of the router node 
connecting subtrees {8, 9} and {10, 11}. From there, 777,1 and 777,2 can travel through different 
upstream ports of the Lo-iontei connecting nodes 8 and 9 to their destinations. 

For k = 1, the fc-subtree is {2, 3}. We may choose node 2 as the source of 777,2. The destination 
of 777,2 remains fixed at node 8. If the destination of 777,1 is node 9, a collision occurs if 777,1 
and 777,2 arrive at the same Li-router connecting subtrees {8,9} and {10,11}. Such a path 
is possible due to the independent upstream-port selections of the Lo _r outers at the source 
nodes and 2. If these Lo _r outers select upstream ports such that 777,1 and 777,2 arrive at the 
same Li-router connecting the Lo _r outers, both messages will arrive at the same Li-router 
connecting subtrees {8,9} and {10, 11}, leading to a collision. 

The situation is similar for k = 2. In this case, collisions may occur at the L\ or L2-routers on 
the downstream paths of 777,1 and 777,2. The dissection of the destinations of 777,1 into r-subtrees 
restricts the number of message paths of 777,1 such that there exists only one out of 2 r+1 
upstream paths that causes a collision on the downstream path. For r = 0, the destination 
of 777,1 is node 9. There are 2 0+1 = 2 possible upstream paths of 777,1 depending on the path 
selection of the L^-router at source node of 777,1. One of the two upstream paths leads to a 
collision at an Li-router or L2-router on the downstream path. The other upstream path is 
collision- free. For r = 1, one out of 2 2 = 4 upstream paths leads to a collision at an L2-router 
on the downstream path. 

In general, we find that collisions may occur for a particular k at router nodes at level r + 1 or 
higher for < r < k. The probability that the messages collide can be derived by considering 
their upstream paths. Due to the symmetry of the tree, a collision occurs on the downstream 
path at level r + 1 or higher due to independent path selections of the upstream routers at 
levels below r + 1. Assuming that the path of 777,2 is fixed, there are 2 r+1 possible upstream 
paths for 777,1, only one of which can lead to a collision on the downstream path. Since the 
random upstream-path selections are independent below router level r + 1, the probability 
that 777,1 chooses the collision path is l/2 r+1 . 

Using the same accounting argument for the paths of 777,1 and 777,2 as in case j < i = A;, we 
find that 

p r* ■ 1 I i-uY^ JL fc + 2 

r=0 

This result coincides with the elements on the diagonal of the matrix in Equation 2 for 
i = j > k. 

We now turn to computing Pr[Cy from Equation 1. The sums over i and j can be computed 
as a function of k from Equation 2 by adding up the row elements for i = k and < j < A;, the 
column elements for j = k and < i < A;, and the diagonal elements for i = j and k < i <\gn — 1: 

lgn-l lgn-1 fc-l . 9 fc-l . 9 lgn-1 , 9 

z=0 j=0 z=0 j=0 v=k+l 

fc-1 lgn-1 , 9 

= E(* + 2)-2*+ E ^-2" 

z=0 v=k+l 



= k .2 k + k -±^(n-2 k+l ). 
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Finally, we compute the sum over k to obtain Pr[C2J: 
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Proving the upper and lower bounds for PrfCy is trivial and is left to the reader. ■ 

We can now reap the benefits from the simple yet laborious chore of proving Lemma 1, and 
apply Lemma 1 to the case where more than two messages enter the network. 

5 The Balls-and-Bins Model 

We now consider m > 2 messages. We assume that m distinct message sources are chosen randomly, 
and that m potentially identical destinations are chosen independently and at random. Since each 
source can transmit only one message at a time, n is an upper bound for ra, and we have 2 < m < n. 
Lemma 1 enables us to model the collision behavior of m messages by means of a classical balls-and- 
bins game. A message transmission corresponds to a ball that is tossed randomly and independent 
of other tosses into a collision bin. Two messages collide, if the corresponding balls land in 
the same collision bin. The only piece of information that we supply to the balls-and-bins game is 
probability Pr[Cy according to Lemma 1. The number of collision bins shall reflect this probability, 
and is therefore chosen as follows. 

Corollary 1 (Bin Calibration) The number of bins b of the balls-and-bins game modeling the 
collision probability in the full- duplex fat tree is 2n/lgn. 

Proof: We toss two balls independently and at random into b collision bins. The probability 
that both balls land in the same bin is 1/6. This probability shall be equal to the average collision 
probability of two messages Pr[Cy. Consequently, choosing 

2n 1 

b=- < 



lgn " Pr[C 2 ] 

yields a conservative analysis, but does not affect our complexity result, because b differs by a small 
constant factor from the actual value only. ■ 

Recall that Pr[Cy is the average probability across all possible distinct sources and potentially 
identical destinations. Therefore, when considering m > 2 messages, more than two balls may land 
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Figure 3: Simulations of Model I including a destination toss and Model II without the destination 
toss for m = n/8 (left) and m = n (right). The number of delivery rounds due to the two models 
differ by a small constant factor only. 

in a particular collision bin. All of the corresponding messages shall collide in the network. Our 
key approximation of the collision behavior is that only one of these messages shall survive the 
collisions. Hence, all but one ball in a collision bin correspond to rejected messages, and one ball 
corresponds to a delivered message. We call the tossing of balls into collision bins a collision toss 
and its equivalent with respect to message transmissions a delivery round. All messages rejected 
during one delivery round are retried in a subsequent delivery round. The number of delivery 
rounds needed to deliver all messages determines the performance of the network. 

The model of the collision behavior by means of the balls-and-bins game described above de- 
serves further discussion. In fact, this model may appear to be unacceptably crude, because it 
ignores a variety of dependencies, most notably those dependencies imposed by the distribution 
of message destinations. We argue, however, that we may neglect these dependencies safely. We 
provide empirical evidence in Section 7 that the balls-and-bins model reflects reality at the level of 
end-to-end performance with sufficient accuracy, indeed. 

As an aside, let us show that the dependencies due to the distribution of message destinations 
affect the number of delivery rounds by a constant factor only compared to neglecting them. To 
account for the distribution of message destinations, we may construct a model of two balls-and-bins 
games. The first game consists of a single toss, the destination toss, of m balls into n destination 
bins representing the random choice of message destinations. The second game consists of repeated 
collision tosses into 1/PrfCy collision bins representing delivery rounds. During the collision 
game we may toss all of the balls representing a single destination into the same collision bin. 
This construction would express the fact that if there were no messages other than those with 
the same destination, these messages will surely collide with each other. The destination bin with 
the maximum number of balls constitutes a critical path across the delivery rounds. For m = n 
messages, the critical-path length is 0(logn/loglogn) with high probability. Note that, like our 
original model, this more realistic model is merely an approximation as well, because it treats the 
collision behavior by means of the average collision probability Pr[Cy. 

Let us call the model with the destination toss Model I and our original, simpler balls-and-bins 
game Model II. Simulations show that the number of delivery rounds due to these two models 
differ by a constant factor only. Figure 3 shows the number of delivery rounds for 2 4 < n < 2 20 
processing nodes and for the number of messages m = n/8 and m = n. Both graphs show the 
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minimum, average, and maximum number of delivery rounds as error bars. In addition, we show 
the ratio of the average number of delivery rounds for Model I and Model II, the ratio of the 
maximum number of delivery rounds, as well as their mean values over n. The mean values are 
horizontal lines and are consequently independent of n. Since the data points deviate only slightly 
from the mean values, we conclude that the number of delivery rounds due to Model I and Model II 
differ by a small constant factor only. 

Using the potential method [2], we can construct a proof that considers the dependencies of 
the destination distribution expressed by Model I. This proof yields the claimed result that the 
number of delivery rounds is bound by O(lgra). Although the potential method is an elegant proof 
technique, we feel that using Model II results in an even simpler, straightforward proof, and it 
exposes the inherent problem structure clearly. In the following, we are therefore concerned with 
the analysis of Model II only. 

We now turn our attention to results from basic probability theory about the balls-and-bins 
game underlying Model II. A delivery round corresponds to tossing m balls into b collision bins. We 
calculate the number of delivered and rejected messages as follows. After tossing m balls, there will 
be b e empty bins and b n = b — b e nonempty bins. The number of delivered messages corresponds 
to the number of nonempty bins 6 n , because each nonempty bin contains at least one ball. The 
number of rejected messages corresponds to the number of balls in the nonempty bins minus one 
ball per nonempty bin which corresponds to a delivered message. Hence, the number of rejected 
messages is m — b n . 

The expected number of empty bins in the balls-and-bins game can be calculated as follows. 
The probability that a bin remains empty after tossing m balls is 

1 \ m 

1--) <«-T, 

since 1 + x < e x for all x. Let Xi be an indicator variable with value 1 if bin i is empty and with 
value otherwise. Then, i?[X^ = (1 — l/b) m . By linearity of expectation, the expected number of 
empty bins is 

k / i \ m 

E[b e ] = Y,E[X l ] = b(l--\ <be~T. (3) 

Z = l ^ ' 

By linearity of expectation, the expected number of delivered messages is 

E[D] = b-E[b e ] = b(l-(l-±y^) 

> fc(l-e-f), (4) 

and the expected number of rejected messages is 

E[R] =m- E[D] <m-b(l- e'f^j . (5) 

The rejected messages of one delivery round are subject to retry in the subsequent delivery round. 

6 Proof of Time Bound 

We model the fat-tree network with n processing nodes and the architecture and collision behavior 
described in Section 3 by means of Model II developed in Section 5. Each of the m < n messages 
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corresponds to a ball. The collision bins are used to capture the collision behavior of the network. 
We will establish the following statement: 

Theorem 1 (Bound of Delivery Rounds) In balls- and-bins model II of a fat-tree network with 
n processing nodes, m < n messages with randomly chosen, distinct sources and independently and 
randomly chosen destinations are delivered within 0(lgra + lnl/e) delivery rounds with probability 
1 — e for any e > 0. 

Our proof consists of two phases. In phase I we prove that the number of messages m > 2n/ lg n 
is reduced to 2n/lgn messages within 0(lgra + lnl/e) rounds with probability 1 — e. In phase II 
we prove that m < 2n/ lgn messages are delivered within 0(lgra + In 1/e) rounds with probability 
1 — e. Together, phases I and II yield the claimed bound. 

To facilitate the analysis, we assume that the retry strategy of the network interfaces of the 
processing nodes is such that the delivery rounds do not overlap. Thus, all network interfaces 
wait until all messages transmitted at the beginning of one delivery round are either delivered or 
rejected. 

6.1 Analysis of Phase I 

The analysis of phase I for m > 2nj lg n messages is based on the observation that tossing m > 
2nj lg n balls into b = 2nj lg n collision bins will inevitably result in more than one ball landing 
in one or more bins. For m < n, the number of balls landing in each bin will be relatively large, 
corresponding to a large number of collisions during that round. 

The distribution of m balls over b bins follows the binomial distribution. A well-known result, 
that we may apply to this distribution, is the Markov Inequality [11]. It states that for a 
non-negative random variable X and any positive real t we have 

Pr[x > t] < ffl. 

Since the number of empty bins b e after a collision toss is a random variable, we may apply the 
Markov Inequality to b e . During phase I, we have m > 2n/\gn = b. Thus, the expected number of 
empty bins when tossing m balls is according to Equation 3 

E[b e ] < be~T < - for m > b. 
e 

We choose t = 26/e, and obtain from the Markov Inequality 

Pr[6 „> M/e] <|M< ». 

Equivalently, the probability that the number of empty bins b e is less than 2b/ e is greater than 1/2. 
We define a delivery round to be a successful delivery round if less than 2b/ e collision 
bins remain empty. Correspondingly, more than b — 2b /e = 6(1 — 2/e) messages are delivered in 
a successful round. Amongst all delivery rounds, a successful delivery round occurs at least with 
probability 1/2 according to the Markov Inequality. By definition, for each delivery round of phase I 
we have m > b. Therefore, in each successful delivery round of phase I, a constant number of at 
least 6(1 — 2/e) messages is delivered. Considering successful delivery rounds only, phase I ends 
after S successful delivery rounds when the number of remaining messages is reduced to b messages. 
Therefore, phase I is subject to the boundary condition: 

m-S-b(l--\=b. 
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Solving for 5, we obtain: 



s = 1 ( m 1 

(1 - 2/e) V b 



(mlgn 1) (6) 



(1 - 2/e) V 2n 

s (r^WG 18 ™- 1 ) (7) 

= O(lgra), 

since m/\gm < n/lgn for 2 < ra < n. Thus, we have established that the number of messages 
m > 2n/lgn is reduced to 2n/lgn messages within O(lgra) successful delivery rounds during 
phase I. 

It remains to be shown that the number of ordinary delivery rounds R containing S successful 
delivery rounds is of order O(lgra) with high probability. To that end we may assume that deliv- 
ery rounds are independent Bernoulli trials, and apply another well known result, the Chernoff 
Inequality [11]: For a random variable X defined by Pr[X = 1] = p of n Bernoulli trials with 
probability p of success, /i = E[X] = np, and < 6 < 1, we have 

Pr[X < (1-<S)ju] <e~^r. 

For convenience we use symbol p s = 1/2 to denote the lower bound for the probability of the 
occurrence of a successful delivery round. We assume that the number of ordinary delivery rounds R 
is 

R = — (25-41ne) 

Ps 

£ 2 ((T^AoG lgm - 1 )- 41n£ 

= 0(lgra — lne) 

for a small value e. This magic construction of R is justified below due to the fact that the proba- 
bilities following from the Chernoff bound yield the desired result. According to basic probability 
theory, the expected number of successful delivery rounds within R rounds is at least 

li s = R'p s = 2S - 4 lne, 

because a successful round occurs at least with probability p s . We use a slight modification of the 
Chernoff bound 

_{apsf_ 

Pt[X <fi 8 - ap s ] <e ^s , (8) 

where < ap s < /i s , and choose 

a= — (5-41ne). 

Ps 

Note that the condition ap s < /i s holds for any e, since ap s = S — 4 lne < 2S — 41ne = /i s 44> S < 2S. 
We can express /i s as a function of a as follows: 

fi s = 2ap s + 4 In e. 
13 



Now, let X s be a random variable denoting the number of successful delivery rounds. We apply 
the Chernoff bound of Equation 8 to X s and obtain: 



(*PsY 



Pr[X s < fjL s - ap s ] < e 2 ^ 



(*PsY 



+81ne 



<^>Pr[X s < 25-41ne- 5 + 41ne] < e 4 «^ 

=> Pt[X s <S] < e- 2 ^ 

= e -^/ 4 + lne 



<^> 



< e 

Pr[X s > S] > 1-e. 



Hence, the probability that the number of successful delivery rounds X s within R = O (lg m + ln(l/e)) 
delivery rounds exceeds the required number of successful delivery rounds S is greater than or equal 
to 1 — e. Therefore, the number of delivery rounds needed to deliver m > 2n/ lg n messages with 
2n/lgn messages remaining is O (lgra + ln(l/e)) with probability at least 1 — e for any e > 0. We 
have consequently established the proof for phase I. ■ 



6.2 Analysis of Phase II 

During phase II we inject m < 2n/lgn messages into the network. Correspondingly, in our balls- 
and-bins model, we toss m < 2n/lgn balls into b = 2n/lgn collision bins. Since the number of 
balls is less than or equal to the number of bins, we can expect to make progress by delivering a 
constant fraction of the messages in each delivery round. In contrast, we have shown in Section 6.1 
that a constant number of messages is delivered per delivery round in phase I. 

We apply the Markov Inequality to the number of rejected messages in a delivery round as 
follows. According to Equation 5, the expected number of rejected messages is E[R] = m — E[D] < 
m — b ( 1 — e~~ j. Choosing t = (1 — a)m with < a < 1, we express that the fraction (1 — a) 
of the m balls tossed during the delivery round corresponds to rejected messages. Applying the 
Markov Inequality to the number of rejected messages i?, we obtain for 2 < m < b: 



Pt[R > (1 - a)m] 



< 



< 



E[R 



(I -a 
1 



[l -a 
1 



< 



(1-a 
1 



{1-a 
1 



m 



_( m _ 6(1 _,,-?)) 

(i -(i ..-)) 



(1-a 

Note that f(x) = x(l — e~ l ' x ) > 1 — e _1 for x > 1, because df/dx decreases monotonically towards 
for x — )► oo and df/dx(l) = 1 — 2/e > 0. To be meaningful, probability 1/((1 — a)e) must be less 
than 1, providing us with the condition a < 1 — e~ l . 
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Since the number of delivered messages is D = m — i?, we have: 

R > (1 — a)m <<=> D < m — (1 — a)m = am. 
We substitute this term in the Markov Inequality to obtain: 

PrLR > (1 - a)m] < — 

(1 — a)e 

<^> Pt[D < am] < — — 

(1 — a)e 

<^> Pt[D >am] > 1 - - 



- a)e 

We have therefore established that at least a constant fraction a of m messages is delivered with 
probability at least 1 — 1/(1 — a) e within a single delivery round. We define a successful delivery 
round for phase II to be a delivery round in which at least am messages are delivered. A successful 
delivery round occurs with probability at least 1 — 1/(1 — a)e amongst the delivery rounds. 

Considering successful delivery rounds only, we know that at most (1 — a)m messages are 
rejected and must be retried in the subsequent round. Hence, after k successful delivery rounds, at 
most (1 — a) k m messages remain to be delivered. Since the last remaining message will be delivered 
without any collisions, we have the boundary condition: 

(1 - a) s m = 1. 

Choosing a = 1/2, we obtain the number of successful delivery rounds 

S = lgra 

and the probability for the occurrence of a successful delivery round is 1 — 2/e. 

It remains to be shown that the number of ordinary delivery rounds R is of order lg m with high 
probability. Analogous to phase I, we construct a Chernoff bound argument. With the probability 
for a successful delivery round p s = 1 — 2/e, we assume that the number of delivery rounds is 

R= — (21gra-41ne). 

Ps 

The expected number of successful delivery rounds is then at least 

fj s = R - p s = 2 lg m — 4 In e. 

We choose 

a = — (lg m — 4 In e) 

Ps 

and apply the Chernoff bound to random variable X s , which denotes the number of successful 
delivery rounds: 

Pt[X s </j s - aps] < e ^ 
<^> Pr[X s < 2 lg m - 4 In e - lg m + 4 In e] 



< g 4ap s +81ne 
MPs 



Pr[X s < lgra] < e 4 



e -^+i ne 



< e 

<^> Pt[X s > lgra] > 1 - e. 
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Figure 4: Comparison of balls-and-bins model II with round-based fat-tree simulations for ra = n/8 
and ra = n. These graphs are representative for other values of < ra < n. Model II differs from 
the round-based fat-tree simulations by a small constant factor only. The normalized performance 
of a real fat tree with immediate retry is shown in clock cycles with respect to the transmission 
time of one message across the diameter of an n-node fat tree. 

We have established that the number of successful delivery rounds X s within R = O (lg ra + ln(l/e)) 
delivery rounds is larger than lgra with probability at least 1 — e. Because it takes at most lgra 
successful delivery rounds to deliver ra messages, the number of delivery rounds needed to deliver 
ra < 2n/lgn messages is O (lgra + ln(l/e)) with probability at least 1 — e for any e > 0. This 
argument completes the proof for phase II. ■ 



7 Discussion of Result 

To bound the number of delivery rounds in a fat-tree network, we have resorted to a proof method- 
ology where we developed an approximating model of the collision behavior of messages that is 
amenable to rigorous probabilistic analysis. Our balls-and-bins model is not powerful enough to 
derive statements about the micro behavior of the network, for example about the number of col- 
lisions at a particular router. However, we may claim the validity of our proof if we can show that 
our model reflects reality at the level of delivery rounds. To that end, we provide empirical evidence 
that the simple balls-and-bins model II does capture the collision behavior with sufficient accuracy, 
indeed. 

Figure 4 compares three data sets of simulation results for the number of messages ra = n/8 
on the left-hand side and for ra = n on the right-hand side. These graphs are representative for a 
large number of values of ra that we have simulated. Comparison of the number of delivery rounds 
according to Model II with those for round-based fat-tree simulations demonstrates the validity 
of Model II. For the maximum number of messages ra = n that can be in transit during a single 
round, Model II shows the largest deviation from the fat-tree simulations. However, the number 
of delivery rounds predicted by Model II and the round-based fat-tree simulation differ by a small 
constant factor only, analogous to our observation in Figure 3. 

In addition to the round-based simulation results, we show the normalized number of clock cycles 
for delivering ra messages on a fat tree in Figure 4. These results represent the true performance 
of our fat-tree design under the assumption that the transmission of ra = n/8 or ra = n messages 
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Figure 5: Comparison of balls-and-bins model II with round-based fat-tree simulations for n = 64 
and n = 2 20 . These graphs are representative for other values of n < 2 20 . 

starts at the same clock cycle, and that the network interfaces initiate a retransmission of a rejected 
message one clock cycle after sensing a collision. Thus, these simulations drop the simplifying 
assumption that retransmission occurs in rounds on all network interfaces simultaneously. For a 
direct comparison with the round-based simulations, we normalize the number of clock cycles with 
respect to the transmission time of a single message across the diameter of an n-node fat tree 
measured in clock cycles. We conclude from these results that our O(lgra) bound holds for the 
scenario with immediate retry as well. Our round-based model and simulations are conservative 
by a constant factor of about two on average. We report that immediate retry delivers the highest 
performance on our fat-tree architecture compared to other retry strategies, including exponential 
back-off. 

We have developed the fat-tree simulator as a behavioral model of a gate-level implementation 
in order to scale up to 2 20 processing nodes. Our router design has a latency of two clock cycles for 
an advancing message, which includes the path reservation, and one clock cycle for a collision and 
acknowledgment signal to release and traverse the path in the opposite direction, respectively. We 
have implemented various retry strategies in our network interfaces, including round-based retry, 
where all messages transmitted during one delivery round are either delivered or rejected before the 
rejected messages are retransmitted in the subsequent delivery round. This retry strategy requires a 
global synchronization capability, and is not expected to be implemented in real systems. However, 
it allows for a direct comparison with the simulation results of the balls-and-bins model. 

Whereas Figure 4 shows the number of delivery rounds as a function of n for fixed ra, Figure 5 
provides a view on the number of delivery rounds as a function of m for fixed n. Like the graphs 
in Figure 4, these graphs are representative for a large number of experiments for different values 
of n. To avoid clutter, we omit the normalized transmission times for the immediate retry strategy. 
The graphs in Figure 5 exhibit a number of behavioral details of the fat-tree network that deserve 
further discussion: 

1. The balls-and-bins model matches the number of delivery rounds due to the fat-tree simulation 
accurately, in accordance with the results presented in Figure 4. 

2. The error bars in the plots show the variation of the number of delivery rounds due to the 
randomized routing strategy in the fat tree. The fact that the variation is relatively small 
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validates our high-probability result. 

3. Our bound O(lgra) = c\ • lgra + C2 appears as a straight line in the semi-logarithmic plots of 
Figure 5 for c\ = 1 and C2 = 0. At the first glance, even this bound seems to be conservative, 
although it is significantly tighter than O(lgn). 

4. The vertical lines in the plots of Figure 5 represent the boundary between phase I and phase II 
of our proof at ra = 2n/ lg n. 

5. Recall that the number of delivery rounds in phase II is O(lgra) for ra < 2n/ lgn. Indeed, we 
observe this behavior to the left of the vertical line at ra = 2n/lgn. The constant factor c\ 
in O(lgra) = c\ lgra + C2 is obviously much smaller than 1, as a comparison with the straight 
line for lgra reveals. 

6. During phase I of our proof for ra > 2n/ lg n, we made use of the inequality ra/ lg ra < n/lgn 
for 2 < ra < n to bound the number of successful delivery rounds in Equations 6 and 7 of 
Section 6.1. In fact, both Model II and the fat-tree simulations exhibit the behavior of the 
tighter bound m\gn/n < lgra for the number of delivery rounds, as we observe to the right 
of the vertical line at ra = 2n/ lgn. 

7. Figure 5 includes an ad- hoc curve fit of the number of delivery rounds as a superposition of the 
models for phase I and phase II. For phase I, we use ralgn/2n, and lgra/10 + 1 for phase II. 
The sum of both phases yields the curve displayed in Figure 5: lgra/10 + ralgn/2n + 1. 

The simulation results in Figure 5 suggest that 0(lg ra) is in fact the optimal bound for phase II. 
For phase I, O(lgra) is an upper bound of the observed behavior which follows the tighter bound 
0(m lg n/n). Consequently, the simulation results provide experimental evidence for our claim that 
balls-and-bins model II reflects reality with sufficient accuracy, and that the upper bound for the 
number of delivery rounds is indeed O(lgra), independent of number of processing nodes n of the 
fat tree, and independent of the operational phase determined by the number of messages ra. 

We have limited our proof of the time bound to the communication scenario, where distinct 
message sources are chosen randomly and destinations are chosen randomly and independently. 
The rationale behind this choice has been the feasibility of probabilistic analysis. In practice, 
many communication patterns can be approximated by this assumption. For other communication 
patterns this choice appears to be unreasonable. For example, assume that each of ra sources sends 
one message to a single destination node p. Since p may receive one message at a time only, a lower 
bound for the number of delivery rounds is ra. Because the fat-tree architecture guarantees that 
one message will be delivered during a round, ra is also the upper bound. This extreme case leads 
us to the following conjecture about the number of delivery rounds of any communication pattern 
on a fat tree. Assume that ra = rai + ra r , where rai is the number of messages to be transmitted 
or delivered sequentially, and m r is the number of messages whose sources and destinations can be 
approximated by a random distribution. Then, the number of delivery rounds is bound by 

0(m\ + lgra r ). 

In the extreme case where ra = rai, the number of delivery rounds is bound by O(rai). In the 
other extreme case where ra = ra r , the number of delivery rounds is bound by 0(lgra r ) as we 
have proved in Section 6. We may view any case between these extremes as a superposition of a 
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Figure 6: Comparison of the number of delivery rounds for randomly chosen sources and desti- 
nations according to Model II (cf. Figure 4), a transpose permutation (left), and a bit-reversal 
permutation (right) with m = n messages. Our simulations show that the number of delivery 
rounds for the permutations with m = n are strictly larger than for smaller numbers of messages 
m < n. The normalized clock-cycle counts show the real behavior of a fat tree with immediate 
message retry. 

sequential component consisting of m\ messages and a parallel component consisting of m r messages 
with bound 0(m\ + lgra r ) for the number of delivery rounds. 

We also ran experiments to evaluate the performance of the fat-tree for several regular commu- 
nication patterns that are frequently studied in the routing literature [6]. Let p\ . . .p\ gn denote the 
binary representation of node p. 

Cyclic-shift permutation: For a given shift fc, node p sends one message to node q = (p + k)%n. 
This pattern arises for example in stencil computations over a grid. 

Transpose permutation: The node with binary representation p\ . . 'P(\ gn )/2P(\gn)/2+i • • -Pign 
sends one message to node P(i gn )/2+i • • -PignPi * * -P(ign)/2' The primary application using 
this pattern is a matrix transposition. 

Bit-reversal permutation: Node p\ . . .p\ gn sends one message to node p\ gn . . ,p\. This pattern, 
as well as the transpose permutation, is considered traditionally a worst-case routing problem. 

Our simulations indicate that the average number of delivery rounds for each of these commu- 
nication patterns is bounded by O(lgra) for < m < n. Figures 6 and 7 show the simulation 
results for m = n messages. Although we do not present the corresponding graphs, we report that 
the number of delivery rounds of the permutations for m < n is strictly less than those shown in 
the figures. For the transpose permutation on the left-hand side of Figure 6, we generated data 
points only for those cases where the number of processing nodes n is a square. The normalized 
clock-cycle counts show the behavior of a fat-tree network with immediate message retry after a 
collision. The real performance of a fat tree is on average about a factor of two faster than predicted 
by the round-based model. 

The results of the cyclic-shift permutation in Figure 7 are more comprehensive than those in 
Figure 6. Since the number of delivery rounds depends on the shift parameter A;, we present as the 
average number of delivery rounds the average of the delivery rounds of the average over a range of 
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Figure 7: Comparison of the number of delivery rounds for randomly chosen sources and destina- 
tions according to Model II and a cyclic-shift permutation with m = n messages. Our simulations 
show that the number of delivery rounds for the cyclic-shift permutation with m = n are strictly 
larger than for smaller numbers of messages m < n. The normalized clock-cycle counts show the 
real behavior of a fat tree with immediate message retry. 

shift parameters < k < n. The minimum and maximum number of delivery rounds of each error 
bar represent the corresponding values for all values of k. 

For all three communication patterns, the number of delivery rounds is strictly less than the 
average number of rounds required when destination nodes are chosen randomly. These simula- 
tion results suggest that O(lgra) is the optimal bound not only for randomly chosen sources and 
destinations but for many different communication patterns on the fat-tree network. 

8 Conclusion 

We have shown that m < n messages with randomly chosen, distinct sources and independently 
and randomly chosen destinations are delivered in a fat-tree network with n processing nodes 
within O(lgra) delivery rounds with high probability. Our proof methodology is based on an 
approximating collision model of the messages transmitted into the network. This model constitutes 
a tradeoff between simplicity and accuracy. It facilitates a relatively simple probabilistic analysis 
and reflects reality with sufficient accuracy at the same time. We have presented empirical evidence 
to validate our claim that O(lgra) is a tight upper bound for the delivery of messages not only 
under the simplifying assumptions that enable our analysis, but also for practical implementations 
and communication scenarios on a fat-tree network. 
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